Jax: Scanner Generator Internals
I've "compressed" the rows of the state table by directly generating code for rows that have only a few different transitions. The trick of generating code directly was one I stole from the neat re2c package. I've used it mostly because it was a quick way to reduce the size of most scanners generated by jax without sacrificing any time in other methods that involve doing multiple table lookups.
This is what it really means. Many scanner generators create a little loop that spins away on a state machine, with some variant of
while (state != ACCEPTING_STATE) state = next(getchar(), state);After table compression, the next() function is usually implemented as a couple of table lookups in arrays.
In jax's variant of this loop, it has a giant switch statement for all the states, and where it seems useful, the computation of the next function is implemented by a set of if statements rather than an array lookup. For instance, with a regular expression that looks like
/ba(na)+/The generated code for the state transitions (lightly edited for readability) looks like
switch(jax_cur_state) {
case 0:
if (jax_next_char == 'n') { jax_cur_state = 4;}
else jax_cur_state = -1;
break;
case 1:
if (jax_next_char == 'b') { jax_cur_state = 2;}
else jax_cur_state = -1;
break;
case 2:
if (jax_next_char == 'a') { jax_cur_state = 3;}
else jax_cur_state = -1;
break;
case 3:
if (jax_next_char == 'n') { jax_cur_state = 4;}
else jax_cur_state = -1;
break;
case 4:
if (jax_next_char == 'a') { jax_cur_state = 0;}
else jax_cur_state = -1;
break;
default:
jax_cur_state = -1;
break;
}
The state machine for this example begins in state 1 incidentally.
This method avoids generating state tables for rows that
have only a few different transitions. In addition, in the
special case a state maps to itself most of the time, the
switch statement is bypassed, and a while loop is generated. For
instance, jax's regular expression to remove single
line comments starting with a #
/#.*/maps into the following code (lightly edited)
case 4:
boolean jax_bool_4 = false;
while ((jax_next_char != 0) && (jax_next_char != '\n'))
{
jax_bool_4 = true;
jax_next_char = jax_advance_char();
}
In "typical" scanners, this allows things like strings and comments to
be quickly slurped up.The second hack is in storing the state transition tables for a row as a string instead of a static array, and then calling the getChars() on the static string at runtime to convert it into an array. This lowers both the size of the class and the time taken to initialize the array. The java compiler creates static arrays by actually generating code to initialize this at runtime, while strings are stored as is in the class file. The getChars() method uses a fast native coded copy operation (System.arraycopy) to perform the copy.
Jax's performance is largely affected by the length of the pattern it is matching. Jax performs better when it is matching large sequences than shorter ones. The reason is that jax has to save some amount of context each time it performs an action, and also as jax accumulates larger and larger data in its expanding input buffers, it gets larger blocks of data each time, which seems to improve the overall input reading speed. To a smaller degree, matching "simple" patterns like strings and keywords kicks in some additional optimizations.
I've compared the performance of jax on three types of specifications, and compared it with just reading one byte at a time from a BufferedInputStream. All inputs are from the /usr/dict/words file, which is about a 206K ascii file.
All four programs used are in the distribution, under the bnchmark directory. The system I used is a sparc 10 running SunOS 5.4 under lightly loaded conditions. All programs were compiled with javac -O and the output is the time returned for the third successive run of each program.
__________________________________________________________________|lurch | |____________________|slurp | |______________________________________|jaxTokenizer | |_________________________________________|buffer
KB Sriram
Comments, bug reports: kbs@sbktech.org
Revised: Sun May 26 09:41:10 1996
URL: http://www.sbktech.org/jax-int.html